Add agent experience testing skill, expand .claude by muhsinking · Pull Request #561 · runpod/docs

muhsinking · 2026-03-20T02:57:27Z

Agent experience testing skill

Summary

Adds a lightweight framework for testing documentation quality by having AI coding agents attempt real-world tasks using only the docs. Tests reveal documentation gaps by simulating what happens when a user asks "how do I deploy a vLLM endpoint?" without any prior context.

Philosophy

Tests are intentionally hard to pass. Each test is a single sentence—no hints, no steps, no doc references. If the docs are good, an agent can figure it out. If not, the test reveals exactly where users get stuck.

How it works

Define tests as rows in a table (tests/TESTS.md)
Run tests via natural language in Claude Code: Run the vllm-deploy test
Agent searches docs, attempts the goal, cleans up resources
Agent writes a report identifying documentation gaps and suggesting improvements

Two doc source modes

Mode	Command	Use case
Published	`Run the vllm-deploy test`	Test live documentation
Local	`Run the vllm-deploy test using local docs`	Validate changes before publishing

Local mode reads .mdx files directly from the repo, letting you test doc changes on a branch before merging.

Test coverage

~85 tests across 13 product areas:

Flash SDK (13 tests): Deploying Python functions to GPUs
Serverless Endpoints (18 tests): Creating endpoints, scaling, streaming, webhooks
vLLM (6 tests): LLM deployment and OpenAI compatibility
Pods (9 tests): GPU instances, SSH, storage
Storage (11 tests): Network volumes, S3 API, file transfer
Templates (6 tests): Creating and using templates
Instant Clusters (4 tests): Multi-node deployments
SDKs & APIs (8 tests): Python, JavaScript, GraphQL
CLI (6 tests): runpodctl operations
Integrations (4 tests): Cursor, Vercel AI, SkyPilot
Tutorials (9 tests): End-to-end workflows

Test format

Tests are minimal by design:

| serverless-serve-qwen | Create an endpoint to serve a Qwen model | Hard |

That's it. The agent must figure out everything else from the docs.

Report output

After each test, reports are saved to tests/reports/ (gitignored):

# Test Report: Create a serverless endpoint

**Date:** 2026-03-19 20:16:07
**Status:** PASS

## What Happened
Successfully created a serverless endpoint...

## Where I Got Stuck
Finding the templateId was not obvious...

## Documentation Gaps
1. No "list templates" step in endpoint creation docs
2. Template ID discovery is buried...

## Suggestions
1. Add a "Prerequisites" section...

Files changed

tests/
├── README.md          # Quick reference
├── TESTS.md           # All test definitions (single file)
└── reports/           # Test reports (gitignored)

.claude/
└── testing.md         # Agent instructions for running tests

README.md              # Added "Agent experience testing" section
.gitignore             # Added tests/reports/

Requirements

Claude Code with MCP servers configured:

claude mcp add runpod -e RUNPOD_API_KEY=your_key -- npx -y @runpod/mcp-server@latest
claude mcp add runpod-docs --transport http https://docs.runpod.io/mcp

Safety

All test resources use doc_test_ prefix
Cleanup runs after each test
Tests only create/delete their own resources

runpod-Henrik

PR #561 — Add agent experience testing framework, expand .claude

No prior reviews found — this is a first-time review.

1. MCP tool name typo — `.claude/testing.md`

Issue: Line 413 references mcp__runpod-dops__search_runpod_documentation. The MCP server is registered as runpod-docs, so the tool name should be mcp__runpod-docs__search_runpod_documentation. The misspelling will cause all Published Docs mode test runs to fail — the tool won't resolve.

2. Test table format doesn't match documented spec

Issue: tests/README.md and .claude/testing.md both say each test has three fields: ID, Goal, and Cleanup. The actual tables in TESTS.md have columns ID | Goal | Difficulty — no Cleanup column. An agent reading a test definition won't know what resources to clean up from the table; it has to infer from the bottom section. Either update the spec to say cleanup rules are global (not per-test), or add the Cleanup column back to the tables.

3. Port limit accuracy — `runpodctl-create-pod.mdx`

Question: The original said "Maximum of 1 HTTP port and 1 TCP port allowed." The new text says "up to 10 HTTP ports and multiple TCP ports." Is that backed by actual runpodctl behavior? If the original limit still applies to the CLI (even if the REST API allows more), this would mislead users into configurations that fail.

4. Framework vs catalog

This is a well-conceived idea but the implementation is a catalog, not a framework. A few structural gaps will limit how useful it is in practice:

No automation layer. Tests are triggered by a human typing natural language to Claude Code. There's no runner, no CI hook, no batch mode. 85 tests that require manual one-by-one triggering will never get run systematically.

Results are ephemeral. tests/reports/ is gitignored. There's no history, no trend tracking, no way to know which tests consistently fail across doc changes.

No smoke test tier. Many tests require live GPU deploys. There's no defined fast subset (10–15 tests) suitable for running before every merge. Without that, the full suite is too expensive to run regularly.

No success criteria. Difficulty (Easy/Hard) isn't a pass condition. The agent decides what PASS means, which will be inconsistent across runs. A brief expected outcome per test (e.g. "endpoint responds with 200 to a /runsync request") would anchor the verdict.

No cleanup safety net. If a test crashes mid-run, doc_test_ resources are orphaned. A cleanup script (e.g. delete all resources matching doc_test_*) would prevent cost surprises.

The local-docs mode for pre-merge validation is the most immediately useful feature here. The published-docs batch testing vision is worth pursuing but needs the automation layer first.

Nits

.gitignore is missing a trailing newline.
Double --- separator before the Cleanup Rules section in TESTS.md — looks like a copy-paste artifact.

Verdict

NEEDS WORK — The MCP tool name typo (#1) silently breaks Published Docs mode. The Cleanup column mismatch (#2) creates ambiguity for running agents. The port limit change (#3) needs factual verification. Section 4 is not a blocker but worth discussing before the suite grows further.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runpod-Henrik

Delta review — since 2026-03-20 13:52

1. Issue resolved — MCP tool name typo

Fixed prior to last review (per improvement plan). Confirmed: .claude/testing.md and .claude/commands/test.md both use mcp__runpod-docs__search_runpod_documentation.

2. Issue resolved — Test table format mismatch

Option B adopted: cleanup rules are global (defined once at the bottom), not per-test. Tables now have ID | Goal | Expected Outcome, with a callout pointing to the global Cleanup Rules section. Clear and consistent.

3. Issue resolved — Port limit accuracy

Verified against pods/configuration/expose-ports.mdx (confirmed 10 HTTP ports). The new text is accurate.

4. Structural gaps — status update

Gap	Status
No smoke test tier	Resolved — 12 smoke tests added (no GPU deploys)
No success criteria	Resolved — `Difficulty` column replaced with `Expected Outcome` across all tables
No cleanup safety net	Resolved — `cleanup.py` added with dry-run and `--delete` flags
Ephemeral results	Partially addressed — dual-location saving (`tests/reports/` + `~/Dev/doc-tests/`), `stats.py` for trend tracking
No automation layer	Still deferred — acceptable for now

5. Nits resolved

.gitignore trailing newline and double --- separator in TESTS.md both fixed.

6. New: batch mode — minor confirmation step inconsistency

.claude/commands/test.md specifies step 2 of batch execution as "Show test list — ask for confirmation before running." .claude/testing.md's batch section omits this step and goes straight to "Run sequentially." An agent following testing.md will run a full category silently without confirmation. Worth aligning — the confirmation step in commands/test.md is the safer behaviour.

Nits

Category test counts in commands/test.md (e.g., serverless | 20, pods | 11) will silently drift as tests are added. Either remove the count column or note it's approximate. Low impact but will mislead when the table goes stale.
IMPROVEMENT_PLAN.md removed — correct, all items tracked there are done.

Verdict

PASS — all blockers from prior reviews are resolved. The batch mode addition is well-structured and consistent across the three files that needed updating. One minor behaviour gap in batch confirmation, but not a blocker.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Add agent experience testing framework, expand .claude

5bdd3be

mintlify bot deployed to staging March 20, 2026 02:58 View deployment

muhsinking requested a review from runpod-Henrik March 20, 2026 02:59

Update Pods docs to improve AX

23b7496

mintlify bot deployed to staging March 20, 2026 13:08 View deployment

Add terminal workflow to the Pods quickstart

ab6ac90

mintlify bot deployed to staging March 20, 2026 13:40 View deployment

Improve Pod quickstart terminal steps

4f001b0

mintlify bot deployed to staging March 20, 2026 13:51 View deployment

runpod-Henrik reviewed Mar 20, 2026

View reviewed changes

Improve agent tests

6201fdf

mintlify bot deployed to staging March 20, 2026 16:42 View deployment

Update styleguide with guidance on using cards for links

c0b46c3

mintlify bot deployed to staging March 20, 2026 17:02 View deployment

remove testing improvement plan

d6d0910

mintlify bot deployed to staging March 20, 2026 17:03 View deployment

muhsinking marked this pull request as ready for review March 20, 2026 17:19

muhsinking requested a review from runpod-Henrik March 20, 2026 17:19

mintlify bot deployed to staging March 20, 2026 17:20 View deployment

Add test batches

b6e21e1

muhsinking force-pushed the agent-tests branch from c9cb586 to b6e21e1 Compare March 20, 2026 22:48

mintlify bot deployed to staging March 20, 2026 22:48 View deployment

runpod-Henrik reviewed Mar 22, 2026

View reviewed changes

Merge branch 'main' into agent-tests

f6298bf

muhsinking requested a review from runpod-Henrik March 23, 2026 14:31

mintlify bot deployed to staging March 23, 2026 14:32 View deployment

Merge branch 'main' into agent-tests

069db28

mintlify bot deployed to staging March 23, 2026 15:25 View deployment

Merge branch 'main' into agent-tests

7491275

mintlify bot deployed to staging March 23, 2026 17:50 View deployment

Merge branch 'main' into agent-tests

a843f82

muhsinking changed the title ~~Add agent experience testing framework, expand .claude~~ Add agent experience testing skill, expand .claude Mar 24, 2026

mintlify bot deployed to staging March 24, 2026 15:55 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent experience testing skill, expand .claude#561

Add agent experience testing skill, expand .claude#561
muhsinking wants to merge 12 commits intomainfrom
agent-tests

muhsinking commented Mar 20, 2026 •

edited

Loading

Uh oh!

runpod-Henrik left a comment

Uh oh!

runpod-Henrik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

muhsinking commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent experience testing skill

Summary

Philosophy

How it works

Two doc source modes

Test coverage

Test format

Report output

Files changed

Requirements

Safety

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. MCP tool name typo — .claude/testing.md

2. Test table format doesn't match documented spec

3. Port limit accuracy — runpodctl-create-pod.mdx

4. Framework vs catalog

Nits

Verdict

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. Issue resolved — MCP tool name typo

2. Issue resolved — Test table format mismatch

3. Issue resolved — Port limit accuracy

4. Structural gaps — status update

5. Nits resolved

6. New: batch mode — minor confirmation step inconsistency

Nits

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

muhsinking commented Mar 20, 2026 •

edited

Loading

1. MCP tool name typo — `.claude/testing.md`

3. Port limit accuracy — `runpodctl-create-pod.mdx`